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Abstract 

A central problem in machine learning involves 
modeling complex data-sets using highly flexi¬ 
ble families of probability distributions in which 
learning, sampling, inference, and evaluation 
are still analytically or computationally tractable. 

Here, we develop an approach that simultane¬ 
ously achieves both flexibility and tractability. 

The essential idea, inspired by non-equilibrium 
statistical physics, is to systematically and slowly 
destroy structure in a data distribution through 
an iterative forward diffusion process. We then 
learn a reverse diffusion process that restores 
structure in data, yielding a highly flexible and 
tractable generative model of the data. This ap¬ 
proach allows us to rapidly learn, sample from, 
and evaluate probabilities in deep generative 
models with thousands of layers or time steps, 
as well as to compute conditional and posterior 
probabilities under the learned model. We addi¬ 
tionally release an open source reference imple¬ 
mentation of the algorithm. 

1. Introduction 

Historically, probabilistic models suffer from a tradeoff be¬ 
tween two conflicting objectives: tractability and flexibil¬ 
ity. Models that are tractable can be analytically evaluated 
and easily fit to data (e.g. a Gaussian or Laplace). However, 
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these models are unable to aptly describe structure in rich 
datasets. On the other hand, models that Sive flexible can be 
molded to fit structure in arbitrary data. For example, we 
can deflne models in terms of any (non-negative) function 
0(x) yielding the flexible distribution p (x) = where 
Z is a normalization constant. However, computing this 
normalization constant is generally intractable. Evaluating, 
training, or drawing samples from such flexible models typ¬ 
ically requires a very expensive Monte Carlo process. 

A variety of analytic approximations exist which amelio¬ 
rate, but do not remove, this tradeoff-for instance mean 
held theory and its expansions (T, 1982; Tanaka, 1998), 
variational Bayes (Jordan et al., 1999), contrastive diver¬ 
gence (Welling & Hinton, 2002; Hinton, 2002), minimum 
probability flow (Sohl-Dickstein et al., 2011b;a), minimum 
KL contraction (Lyu, 2011), proper scoring rules (Gneit- 
ing & Raftery, 2007; Parry et al., 2012), score matching 
(Hyvarinen, 2005), pseudolikelihood (Besag, 1975), loopy 
belief propagation (Murphy et al., 1999), and many, many 
more. Non-parametric methods (Gershman & Blei, 2012) 
can also be very effective^. 

1.1. Diffusion probabilistic models 

We present a novel way to deflne probabilistic models that 
allows: 

1. extreme flexibility in model structure, 

2. exact sampling, 

^Non-parametric methods can be seen as transitioning 
smoothly between tractable and flexible models. For instance, 
a non-parametric Gaussian mixture model will represent a small 
amount of data using a single Gaussian, but may represent infinite 
data as a mixture of an infinite number of Gaussians. 
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3. easy multiplication with other distributions, e.g. in or¬ 
der to compute a posterior, and 

4. the model log likelihood, and the probability of indi¬ 
vidual states, to be cheaply evaluated. 

Our method uses a Markov chain to gradually convert one 
distribution into another, an idea used in non-equilibrium 
statistical physics (Jarzynski, 1997) and sequential Monte 
Carlo (Neal, 2001). We build a generative Markov chain 
which converts a simple known distribution (e.g. a Gaus¬ 
sian) into a target (data) distribution using a diffusion pro¬ 
cess. Rather than use this Markov chain to approximately 
evaluate a model which has been otherwise defined, we ex¬ 
plicitly define the probabilistic model as the endpoint of the 
Markov chain. Since each step in the diffusion chain has an 
analytically evaluable probability, the full chain can also be 
analytically evaluated. 

Learning in this framework involves estimating small per¬ 
turbations to a diffusion process. Estimating small pertur¬ 
bations is more tractable than explicitly describing the full 
distribution with a single, non-analytically-normalizable, 
potential function. Furthermore, since a diffusion process 
exists for any smooth target distribution, this method can 
capture data distributions of arbitrary form. 

We demonstrate the utility of these diffusion probabilistic 
models by training high log likelihood models for a two- 
dimensional Swiss roll, binary sequence, handwritten digit 
(MNIST), and several natural image (CIFAR-10, bark, and 
dead leaves) datasets. 

1.2. Relationship to other work 

The wake-sleep algorithm (Hinton, 1995; Dayan et al., 
1995) introduced the idea of training inference and gen¬ 
erative probabilistic models against each other. This 
approach remained largely unexplored for nearly two 
decades, though with some exceptions (Sminchisescu et al., 
2006; Kavukcuoglu et al., 2010). There has been a re¬ 
cent explosion of work developing this idea. In (Kingma 
& Welling, 2013; Gregor et al., 2013; Rezende et al., 2014; 
Ozair & Bengio, 2014) variational learning and inference 
algorithms were developed which allow a fiexible genera¬ 
tive model and posterior distribution over latent variables 
to be directly trained against each other. 

The variational bound in these papers is similar to the one 
used in our training objective and in the earlier work of 
(Sminchisescu et al., 2006). However, our motivation and 
model form are both quite different, and the present work 
retains the following differences and advantages relative to 
these techniques: 

1. We develop our framework using ideas from physics, 
quasi-static processes, and annealed importance sam¬ 
pling rather than from variational Bayesian methods. 


2. We show how to easily multiply the learned distribu¬ 
tion with another probability distribution (eg with a 
conditional distribution in order to compute a poste¬ 
rior) 

3. We address the difficulty that training the inference 
model can prove particularly challenging in varia¬ 
tional inference methods, due to the asymmetry in the 
objective between the inference and generative mod¬ 
els. We restrict the forward (inference) process to a 
simple functional form, in such a way that the re¬ 
verse (generative) process will have the same func¬ 
tional form. 

4. We train models with thousands of layers (or time 
steps), rather than only a handful of layers. 

5. We provide upper and lower bounds on the entropy 
production in each layer (or time step) 

There are a number of related techniques for training prob¬ 
abilistic models (summarized below) that develop highly 
fiexible forms for generative models, train stochastic tra¬ 
jectories, or learn the reversal of a Bayesian network. 
Reweighted wake-sleep (Bornschein & Bengio, 2015) de¬ 
velops extensions and improved learning rules for the orig¬ 
inal wake-sleep algorithm. Generative stochastic networks 
(Bengio & Thibodeau-Laufer, 2013; Yao et al., 2014) train 
a Markov kernel to match its equilibrium distribution to 
the data distribution. Neural autoregressive distribution 
estimators (Larochelle & Murray, 2011) (and their recur¬ 
rent (Uria et al., 2013a) and deep (Uria et al., 2013b) ex¬ 
tensions) decompose a joint distribution into a sequence 
of tractable conditional distributions over each dimension. 
Adversarial networks (Goodfellow et al., 2014) train a gen¬ 
erative model against a classifier which attempts to dis¬ 
tinguish generated samples from true data. A similar ob¬ 
jective in (Schmidhuber, 1992) learns a two-way map¬ 
ping to a representation with marginally independent units. 
In (Rippel & Adams, 2013; Dinh et al., 2014) bijective 
deterministic maps are learned to a latent representation 
with a simple factorial density function. In (Stuhlmiiller 
et al., 2013) stochastic inverses are learned for Bayesian 
networks. Mixtures of conditional Gaussian scale mix¬ 
tures (MCGSMs) (Theis et al., 2012) describe a dataset 
using Gaussian scale mixtures, with parameters which de¬ 
pend on a sequence of causal neighborhoods. There is 
additionally significant work learning fiexible generative 
mappings from simple latent distributions to data distribu¬ 
tions - early examples including (MacKay, 1995) where 
neural networks are introduced as generative models, and 
(Bishop et al., 1998) where a stochastic manifold mapping 
is learned from a latent space to the data space. We will 
compare experimentally against adversarial networks and 
MCGSMs. 

Related ideas from physics include the Jarzynski equal¬ 
ity (Jarzynski, 1997), known in machine learning as An- 
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Figure 1. The proposed modeling framework trained on 2-d swiss roll data. The top row shows time slices from the forward trajectory 
q . The data distribution (left) undergoes Gaussian diffusion, which gradually transforms it into an identity-covariance Gaus¬ 
sian (right). The middle row shows the corresponding time slices from the trained reverse trajectory p An identity-covariance 

Gaussian (right) undergoes a Gaussian diffusion process with learned mean and covariance functions, and is gradually transformed back 
into the data distribution (left). The bottom row shows the drift term, (x^^^ ? 0 ~ the same reverse diffusion process. 


nealed Importance Sampling (AIS) (Neal, 2001), which 
uses a Markov chain which slowly converts one distribu¬ 
tion into another to compute a ratio of normalizing con¬ 
stants. In (Burda et al., 2014) it is shown that AIS can also 
be performed using the reverse rather than forward trajec¬ 
tory. Langevin dynamics (Langevin, 1908), which are the 
stochastic realization of the Fokker-Planck equation, show 
how to define a Gaussian diffusion process which has any 
target distribution as its equilibrium. In (Suykens & Vande- 
walle, 1995) the Fokker-Planck equation is used to perform 
stochastic optimization. Finally, the Kolmogorov forward 
and backward equations (Feller, 1949) show that for many 
forward diffusion processes, the reverse diffusion processes 
can be described using the same functional form. 

2. Algorithm 

Our goal is to define a forward (or inference) diffusion pro¬ 
cess which converts any complex data distribution into a 
simple, tractable, distribution, and then learn a finite-time 
reversal of this diffusion process which defines our gener¬ 
ative model distribution (See Figure 1). We first describe 
the forward, inference diffusion process. We then show 


how the reverse, generative diffusion process can be trained 
and used to evaluate probabilities. We also derive entropy 
bounds for the reverse process, and show how the learned 
distributions can be multiplied by any second distribution 
(e.g. as would be done to compute a posterior when in¬ 
painting or denoising an image). 

2.1. Forward Trajectory 

We label the data distribution q The data distribu¬ 

tion is gradually converted into a well behaved (analyti¬ 
cally tractable) distribution tt (y) by repeated application 
of a Markov diffusion kernel (y |y'; P) for tt (y), where 
/3 is the diffusion rate, 


T^{y) = J dy'T^{y\y';l3)TT{y') (1) 

q (x(‘)|x(‘-i)) = T, (x(‘)|x(‘-i); a) . (2) 















Deep Unsupervised Learning using Nonequilibrium Thermodynamics 


t = 0 




Figure 2. Binary sequence learning via binomial diffusion. A binomial diffusion model was trained on binary ‘heartbeat’ data, where a 
pulse occurs every 5th bin. Generated samples (left) are identical to the training data. The sampling procedure consists of initialization 
at independent binomial noise (right), which is then transformed into the data distribution by a binomial diffusion process, with trained 
bit flip probabilities. Each row contains an independent sample. For ease of visualization, all samples have been shifted so that a pulse 
occurs in the first column. In the raw sequence data, the first pulse is uniformly distributed over the first five bins. 
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Figure 3. The proposed framework trained on the CIFAR-10 (Krizhevsky & Hinton, 2009) dataset, (a) Example holdout data (similar 
to training data), (b) Holdout data corrupted with Gaussian noise of variance 1 (SNR = 1). (c) Denoised images, generated by sampling 
from the posterior distribution over denoised images conditioned on the images in (b). (d) Samples generated by the diffusion model. 


The forward trajectory, corresponding to starting at the data 
distribution and performing T steps of diffusion, is thus 


(x(°-^)) = q (x(°)) H q (x(‘) |x(*-i)) (3) 


For the experiments shown below, q corre¬ 

sponds to either Gaussian diffusion into a Gaussian distri¬ 
bution with identity-covariance, or binomial diffusion into 
an independent binomial distribution. Table App.l gives 
the diffusion kernels for both Gaussian and binomial distri¬ 
butions. 
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2.2. Reverse Trajectory 

The generative distribution will be trained to describe the 


same trajectory, but in reverse. 



p = 'tt 

1 

T 

(4) 

p(x(0-^)) =p(x(^)) 


(5) 


t=i 


This can be evaluated rapidly by averaging over samples 
from the forward trajectory q . For infinites¬ 

imal [3 the forward and reverse distribution over trajecto¬ 
ries can be made identical (see Section 2.2). If they are 
identical then only a single sample from q 
is required to exactly evaluate the above integral, as can 
be seen by substitution. This corresponds to the case of a 
quasi-static process in statistical physics (Spinney & Ford, 
2013; Jarzynski, 2011). 


For both Gaussian and binomial diffusion, for continuous 
diffusion (limit of small step size P) the reversal of the 
diffusion process has the identical functional form as the 
forward process (Feller, 1949). Since q (x^^^ is a 

Gaussian (binomial) distribution, and if pt is small, then 
q |x^^^) will also be a Gaussian (binomial) distribu¬ 

tion. The longer the trajectory the smaller the diffusion rate 
P can be made. 

During learning only the mean and covariance for a Gaus¬ 
sian diffusion kernel, or the bit flip probability for a bi¬ 
nomial kernel, need be estimated. As shown in Table 
App.l, f^ and f^ are functions defining 

the mean and covariance of the reverse Markov transitions 
for a Gaussian, and f^ (x^^^, f) is a function providing the 
bit flip probability for a binomial distribution. The compu¬ 
tational cost of running this algorithm is the cost of these 
functions, times the number of time-steps. For all results in 
this paper, multi-layer perceptrons are used to define these 
functions. A wide range of regression or function fitting 
techniques would be applicable however, including nonpa- 
rameteric methods. 


2.4. Training 


Training amounts to maximizing the model log likelihood. 


L = j dyS°'>q logp 

= I (xW) . 


log 




■T) 




( 10 ) 


( 11 ) 


which has a lower hound provided by Jensen’s inequality, 


L> I (x(°-^)) ■ 


log 


p (.<->) n 


q (x(^) |x(^“^)) 


( 12 ) 


As described in Appendix B, for our diffusion trajectories 
this reduces to. 


2.3. Model Probability 

The probability the generative model assigns to the data is 

P = J (x(°-^)) . ( 6 ) 

Naively this integral is intractable - but taking a cue from 
annealed importance sampling and the Jarzynski equality, 
we instead evaluate the relative probability of the forward 
and reverse trajectories, averaged over forward trajectories. 



(x(°-^) 


\ ^ 
)q{. 


AO) 


x(i-r)|x(0)) 


') 


p (x^ 






H"' 9n,(xo)|xo-i))- 


( 8 ) 


(9) 


L>K 

T 


(13) 


K = -'^ /'dx^o^dxWg (^x(°),x0)) 

t=2 

Dkl (q ( 


x(t-i)|x(t),x( 0 ) 


) p(x0-i)|x0))) 

+ Hg - Hg (^X(i)|X(°)) - Hp . 

(14) 


where the entropies and KL divergences can be analyt¬ 
ically computed. The derivation of this bound parallels 
the derivation of the log likelihood bound in variational 
Bayesian methods. 

As in Section 2.3 if the forward and reverse trajectories are 
identical, corresponding to a quasi-static process, then the 
inequality in Equation 13 becomes an equality. 

Training consists of finding the reverse Markov transitions 
which maximize this lower bound on the log likelihood, 

^= argmax K. (15) 

p(x(*“l) |x(*) ) 
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The specific targets of estimation for Gaussian and bino- to multiply distributions in the context of diffusion proba- 
mial diffusion are given in Table App. 1 . bilistic models. 


Thus, the task of estimating a probability distribution has 
been reduced to the task of performing regression on the 
functions which set the mean and covariance of a sequence 
of Gaussians (or set the state flip probability for a sequence 
of Bernoulli trials). 

2.4.1. Setting the Dieeusion Rate f3t 

The choice of /3t in the forward trajectory is important for 
the performance of the trained model. In AIS, the right 
schedule of intermediate distributions can greatly improve 
the accuracy of the log partition function estimate (Grosse 
et al., 2013). In thermodynamics the schedule taken when 
moving between equilibrium distributions determines how 
much free energy is lost (Spinney & Ford, 2013; Jarzynski, 
2011 ). 

In the case of Gaussian diffusion, we learn^ the forward 
diffusion schedule P 2 ---T by gradient ascent on K. The 
variance I3i of the first step is fixed to a small constant 
to prevent overfitting. The dependence of samples from 
^ on Pi...T is made explicit by using ‘frozen 

noise’ - as in (Kingma & Welling, 2013) the noise is treated 
as an additional auxiliary variable, and held constant while 
computing partial derivatives of K with respect to the pa¬ 
rameters. 

For binomial diffusion, the discrete state space makes gra¬ 
dient ascent with frozen noise impossible. We instead 
choose the forward diffusion schedule Pi...t to erase a con¬ 
stant fraction ^ of the original signal per diffusion step, 
yielding a diffusion rate of Pt = {T — t . 

2.5. Multiplying Distributions, and Computing 
Posteriors 

Tasks such as computing a posterior in order to do signal 
denoising or inference of missing values requires multipli¬ 
cation of the model distribution p with a second dis¬ 

tribution, or bounded positive function, r ), producing 
a new distribution p oc p r (x^^^). 

Multiplying distributions is costly and difficult for many 
techniques, including variational autoencoders, GSNs, 
NADEs, and most graphical models. However, under a dif¬ 
fusion model it is straightforward, since the second distri¬ 
bution can be treated either as a small perturbation to each 
step in the diffusion process, or often exactly multiplied 
into each diffusion step. Figures 3 and 5 demonstrate the 
use of a diffusion model to perform denoising and inpaint¬ 
ing of natural images. The following sections describe how 

^Recent experiments suggest that it is just as effective to in¬ 
stead use the same fixed Pt schedule as for binomial diffusion. 


2.5.1. Modified Marginal Distributions 

First, in order to compute p (x^^^), we multiply each of 
the intermediate distributions by a corresponding function 
r (x^^^). We use a tilde above a distribution or Markov 
transition to denote that it belongs to a trajectory that has 
been modified in this way. p is the modified re¬ 
verse trajectory, which starts at the distribution p = 

-^p (x^^^) r (x^^^) and proceeds through the sequence of 
intermediate distributions 

#(x«l) = ip(xC>).(x<<)), (16) 

where Zf is the normalizing constant for the tth intermedi¬ 
ate distribution. 


2.5.2. Modified Diffusion Steps 

The Markov kernel p (x^^^ | for the reverse diffu¬ 

sion process obeys the equilibrium condition 

p (x(*) = I (x‘) I x(*+i)) p (x*+i)) . (17) 

We wish the perturbed Markov kernel p (x^^^ | to 

instead obey the equilibrium condition for the perturbed 
distribution, 

P (x“>) 

p (x(*)) r (x^*)) 




( 20 ) 


= J c(x(*+^)p ( 


x(0 I 


x(0 I 


(18) 

)■ 




(19) 




J dx(*+i)p (x(‘) I x(‘+i)) • 

7. .-.r \ J 


Equation 20 will be satisfied if 



Zt+ir (xO)) 
ZfV (x(^+^)) 


( 21 ) 


Equation 21 may not correspond to a normalized proba¬ 
bility distribution, so we choose p (x^^^ to be the 

corresponding normalized distribution 

( 22 ) 
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Figure 4. The proposed framework trained on dead leaf images (Jeulin, 1997; Lee et aL, 2001). fa) Example training image, (b) A sample 
from the previous state of the art natural image model (Theis et al., 2012) trained on identical data, reproduced here with permission. 
(c) A sample generated by the diffusion model. Note that it demonstrates fairly consistent occlusion relationships, displays a multiscale 
distribution over object sizes, and produces circle-like objects, especially at smaller scales. As shown in Table 2, the diffusion model has 
the highest log likelihood on the test set. 


where Zt is the normalization constant. 


For a Gaussian, each diffusion step is typically very sharply 
peaked relative to r ), due to its small variance. This 


means that 




can be treated as a small perturbation 


to p (x^^^ . A small perturbation to a Gaussian ef¬ 

fects the mean, but not the normalization constant, so in 
this case Equations 21 and 22 are equivalent (see Appendix 
C). 


2.5.3. Applying r (x^^^) 

If r (x^^)) is sufficiently smooth, then it can be treated 
as a small perturbation to the reverse diffusion kernel 
p (x^^^ In this case p (x^^^ will have an 

identical functional form to p but with per¬ 

turbed mean for the Gaussian kernel, or with perturbed flip 
rate for the binomial kernel. The perturbed diffusion ker¬ 
nels are given in Table App. 1, and are derived for the Gaus¬ 
sian in Appendix C. 

If r can be multiplied with a Gaussian (or binomial) 

distribution in closed form, then it can be directly multi¬ 
plied with the reverse diffusion kernel p (x^^^ in 

closed form. This applies in the case where r (x^^^) con¬ 
sists of a delta function for some subset of coordinates, as 
in the inpainting example in Figure 5. 

2.5.4. Choosing r 

Typically, r (x^^^) should be chosen to change slowly over 
the course of the trajectory. For the experiments in this 
paper we chose it to be constant, 


r (x(*)) = r (x(°)) . (23) 


Another convenient choice is r (x^^^) = r (x^^^) ^ . Un¬ 
der this second choice r (x^^^) makes no contribution to the 
starting distribution for the reverse trajectory. This guaran¬ 
tees that drawing the initial sample from p for the 

reverse trajectory remains straightforward. 

2.6. Entropy of Reverse Process 

Since the forward process is known, we can derive upper 
and lower bounds on the conditional entropy of each step 
in the reverse trajectory, and thus on the log likelihood, 

Hq +Hq (X(*-1)|X(°)) -Hq (^X^*) |X(°)) 

< Hq (X(‘-1)|X(*)) < Hq (X(*)|X(*-1)) , 

(24) 

where both the upper and lower bounds depend only on 
q |x*^^^), and can be analytically computed. The 

derivation is provided in Appendix A. 


3. Experiments 

We train diffusion probabilistic models on a variety of con¬ 
tinuous datasets, and a binary dataset. We then demonstrate 
sampling from the trained model and inpainting of miss¬ 
ing data, and compare model performance against other 
techniques. In all cases the objective function and gradi¬ 
ent were computed using Theano (Bergstra & Breuleux, 
2010). Model training was with SFO (Sohl-Dickstein et al., 
2014), except for CIFAR-10. CIFAR-10 results used the 

^ An earlier version of this paper reported higher log likeli¬ 
hood bounds on CIFAR-10. These were the result of the model 
learning the 8-bit quantization of pixel values in the CIFAR-10 
dataset. The log likelihood bounds reported here are instead for 
data that has been pre-processed by adding uniform noise to re¬ 
move pixel quantization, as recommended in (Theis et al., 2015). 
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Figure 5. Inpainting, (a) A bark image from (Lazebnik et al., 2005). (b) The same image with the central lOOx 100 pixel region replaced 
with isotropic Gaussian noise. This is the initialization p for the reverse trajectory, (c) The central lOOx 100 region has been 

inpainted using a diffusion probabilistic model trained on images of bark, by sampling from the posterior distribution over the missing 
region conditioned on the rest of the image. Note the long-range spatial structure, for instance in the crack entering on the left side of the 
inpainted region. The sample from the posterior was generated as described in Section 2.5, where r was set to a delta function 

for known data, and a constant for missing data. 


Dataset 

K 

^ Lnull 

Swiss Roll 

2.35 bits 

6.45 bits 

Binary Heartbeat 

-2.414 bits/seq. 

12.024 bits/seq. 

Bark 

-0.55 bits/pixel 

1.5 bits/pixel 

Dead Leaves 

1.489 bits/pixel 

3.536 bits/pixel 

CIFAR-lO^ 

5.4 ± 0.2 bits/pixel 

11.5 ± 0.2 bits/pixel 

MNIST 

See table 2 


Table 1 . The lower bound K on the log likelihood, computed on a 
holdout set, for each of the trained models. See Equation 12. The 
right column is the improvement relative to an isotropic Gaussian 
or independent binomial distribution. Lnuii is the log likelihood 


of TT I 


All datasets except for Binary Heartbeat were scaled 
by a constant to give them variance 1 before computing log like¬ 
lihood. 


Model 

Log Likelihood 

Dead Leaves 


MCGSM 

1.244 bits/pixel 

Diffusion 

1.489 bits/pixel 

MNIST 


Stacked CAE 

174 ±2.3 bits 

DBN 

199 ±2.9 bits 

Deep GSN 

309 ± 1.6 bits 

Diffusion 

317 ±2.7 bits 

Adversarial net 

325 ±2.9 bits 

Perfect model 

349 ± 3.3 bits 


Table 2. Log likelihood comparisons to other algorithms. Dead 
leaves images were evaluated using identical training and test data 
as in (Theis et al., 2012). MNIST log likelihoods were estimated 
using the Parzen-window code from (Goodfellow et al., 2014), 
with values given in bits, and show that our performance is com¬ 
parable to other recent techniques. The perfect model entry was 
computed by applying the Parzen code to samples from the train¬ 
ing data. 


open source implementation of the algorithm, and RM- 
Sprop for optimization. The lower bound on the log like¬ 
lihood provided by our model is reported for all datasets 
in Table 1. A reference implementation of the algorithm 
utilizing Blocks (van Merrienboer et al., 2015) is avail¬ 
able at https://github.com/Sohl-Dickstein/ 
Diffusion-Probabilistic-Models. 

3.1. Toy Problems 

3.1.1. Swiss Roll 

A diffusion probabilistic model was built of a two dimen¬ 
sional Swiss roll distribution, using a radial basis function 
network to generate , t) and , f). As illus¬ 

trated in Figure 1, the swiss roll distribution was success¬ 
fully learned. See Appendix Section D. 1.1 for more details. 


3.1.2. Binary Heartbeat Distribution 

A diffusion probabilistic model was trained on simple bi¬ 
nary sequences of length 20, where a 1 occurs every 5th 
time bin, and the remainder of the bins are 0, using a multi¬ 
layer perceptron to generate the Bernoulli rates , t) 

of the reverse trajectory. The log likelihood under the true 
distribution is log 2 (^) = —2.322 bits per sequence. As 
can be seen in Figure 2 and Table 1 learning was nearly 
perfect. See Appendix Section D.1.2 for more details. 

3.2. Images 

We trained Gaussian diffusion probabilistic models on sev¬ 
eral image datasets. The multi-scale convolutional archi- 
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lecture shared by these experiments is described in Ap¬ 
pendix Section D.2.1, and illustrated in Figure D.l. 

3.2.1. Datasets 

MNIST In order to allow a direct comparison against 
previous work on a simple dataset, we trained on MNIST 
digits (LeCun & Cortes, 1998). Log likelihoods relative to 
(Bengio et al., 2012; Bengio & Thibodeau-Laufer, 2013; 
Goodfellow et al., 2014) are given in Table 2. Samples 
from the MNIST model are given in Appendix Figure 
App.l. Our training algorithm provides an asymptotically 
consistent lower bound on the log likelihood. However 
most previous reported results on continuous MNIST log 
likelihood rely on Parzen-window based estimates com¬ 
puted from model samples. For this comparison we there¬ 
fore estimate MNIST log likelihood using the Parzen- 
window code released with (Goodfellow et al., 2014). 

CIFAR-10 A probabilistic model was fit to the training 
images for the CIFAR-10 challenge dataset (Krizhevsky & 
Hinton, 2009). Samples from the trained model are pro¬ 
vided in Figure 3. 

Dead Leaf Images Dead leaf images (Jeulin, 1997; Lee 
et al., 2001) consist of layered occluding circles, drawn 
from a power law distribution over scales. They have an an¬ 
alytically tractable structure, but capture many of the statis¬ 
tical complexities of natural images, and therefore provide 
a compelling test case for natural image models. As illus¬ 
trated in Table 2 and Figure 4, we achieve state of the art 
performance on the dead leaves dataset. 

Bark Texture Images A probabilistic model was trained 
on bark texture images (T01-T04) from (Lazebnik et al., 
2005). For this dataset we demonstrate that it is straightfor¬ 
ward to evaluate or generate from a posterior distribution, 
by inpainting a large region of missing data using a sample 
from the model posterior in Figure 5. 

4. Conclusion 

We have introduced a novel algorithm for modeling proba¬ 
bility distributions that enables exact sampling and evalua¬ 
tion of probabilities and demonstrated its effectiveness on a 
variety of toy and real datasets, including challenging natu¬ 
ral image datasets. For each of these tests we used a similar 
basic algorithm, showing that our method can accurately 
model a wide variety of distributions. Most existing den¬ 
sity estimation techniques must sacrifice modeling power 
in order to stay tractable and efficient, and sampling or 
evaluation are often extremely expensive. The core of our 
algorithm consists of estimating the reversal of a Markov 
diffusion chain which maps data to a noise distribution; as 


the number of steps is made large, the reversal distribution 
of each diffusion step becomes simple and easy to estimate. 
The result is an algorithm that can learn a fit to any data dis¬ 
tribution, but which remains tractable to train, exactly sam¬ 
ple from, and evaluate, and under which it is straightfor¬ 
ward to manipulate conditional and posterior distributions. 
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A. Conditional Entropy Bounds Derivation 

The conditional entropy Hq of a step in the reverse trajectory is 


Hg (X(*-1\X(*)) 

II 

>1 

1 

(25) 

^X(*-i)|xW) +Hg 

= Hg +Hg 

(26) 


1 

7 

+ 

1 

II 

(27) 


An upper bound on the entropy change can be constructed by observing that tt (y) is the maximum entropy distribution. 
This holds without qualification for the binomial distribution, and holds for variance 1 training data for the Gaussian case. 
For the Gaussian case, training data must therefore be scaled to have unit norm for the following equalities to hold. It need 


not be whitened. The upper bound is derived as follows, 

Hg > Hg (28) 

Hg - Hg (x(*)) < 0 (29) 

Hg (x(*“i)|x(*)) < Hg . (30) 

A lower bound on the entropy difference can be established by observing that additional steps in a Markov chain do not 
increase the information available about the initial state in the chain, and thus do not decrease the conditional entropy of 
the initial state, 

Hg (^x(°)|x(*)) > Hg (31) 

Hg - Hg > Hg (^X^|X^^^ + Hg - Hg - Hg (^X^^ (32) 

Hg - Hg > Hg (^X^, X^^^ - Hg (33) 

Hg - Hg > Hg ^X ^^^^ |X^^ - Hg (34) 

Hg > Hg + Hg (^X^^ | X^ - Hg . (35) 

Combining these expressions, we bound the conditional entropy for a single step, 

Hg > Hg > Hg (^X|X^^^ +Hg (^X^^|X^^ - Hg , (36) 


where both the upper and lower bounds depend only on the conditional forward trajectory q |x^^^), and can be 

analytically computed. 


B. Log Likelihood Lower Bound 

The lower bound on the log likelihood is 


L>K 

K = I dJO '^'q (x<» -^>) log 


- n 


p (x^* 

q (x(*)|x(*“i)) 


(37) 

(38) 


( 39 ) 
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B.l. Entropy of p 

We can peel off the contribution from p , and rewrite it as an entropy, 


/s: = y 

= j 


t=l 

1 1 

1—1 
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1_1 


f dx(^)gfx(^)' 

' \ J 

1 logp 
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q (x(^) 
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' \ J 

1 logTT (x^) 

(41) 






(42) 


By design, the cross entropy to tt is constant under our diffusion kernels, and equal to the entropy of p 

Therefore, 


K = 


T 

^ f (x(°-^)) log 


q (x(^) |x(^“^)) 



(43) 


B.2. Remove the edge effect at t = 0 

In order to avoid edge effects, we set the final step of the reverse trajectory to be identical to the corresponding forward 
diffusion step. 


p(x(°)|xW) =</(xW|x(°)) 


TT (x^^^) 
TT (x(^)) 


= r, (x(°)|xW;/3i). 


We then use this equivalence to remove the contribution of the first time-step in the sum, 


(44) 


T 

K = '^ f log 

t=2 

T 

= J2[ (x(°-^)) log 

t=2 


p (x('-i)|x(')) 
q (x(^) |x(^“^)) 

q (x(^) |x(^“^)) 


g(x(i)|x(Q))^(x(Q)) ] _ 

g (x(^) |x(^)) TT (x(^)) ^ 

(45) 

Hp (X(^)) , (46) 



where we again used the fact that by design — J dyS^^q (x^^^) log tt (x^^^) = Hp (X^^^) is a constant for all t. 

B.3. Rewrite in terms of posterior q (x(t-i)|x(0)) 

Because the forward trajectory is a Markov process. 


K = 



p(x(*-i)|x(*)) 
q (x(*)|x(*“i),x(°)) 



Using Bayes’ rule we can rewrite this in terms of a posterior and marginals from the forward trajectory. 


T 

^ = E /(x(0 -^)) log 

t=2 


p (x(* g (x*^* 

(/ (x(*“l) |x(*), x(°)) (/ (x(*) |x(°)) 



(47) 


( 48 ) 
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Figure App.L Samples from a diffusion probabilistic model trained on MNIST digits. Note that unlike many MNIST sample figures, 
these are true samples rather than the mean of the Gaussian or binomial distribution from which samples would be drawn. 


B.4. Rewrite in terms of KL divergences and entropies 


We then recognize that several terms are conditional entropies, 


T 

K = '^ f log 

t=2 

T 

= J2[ (x(°-^)) log 

t=2 
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Finally we transform the log ratio of probability distributions into a KL divergence, 

T 


(51) 


t=2 

+ Hq (x(^)|x(°)) - Hq (^X(1)|X(°)) - Hp . 

Note that the entropies can be analytically computed, and the KL divergence can be analytically computed given x^^^ and 

xW. 
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Table App.L The key equations in this paper for the specific cases of Gaussian and binomial diffusion processes. Af {u; p, S) is a Gaussian distribution with mean p and covariance 
S. B (u; r) is the distribution for a single Bernoulli trial, with u = 1 occurring with probability r, and u = 0 occurring with probability 1 — r. Finally, for the perturbed Bernoulli 
trials h\ = (1 — fit) + 0.5y5t, c\ = [f^ t") j , and d\ = r = l) , and the distribution is given for a single bit i. 
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C. Perturbed Gaussian Transition 


We wish to compute p For notational simplicity, let /i = f^ (x^^\ t), S 

Using this notation. 

= fs (xO),t), and y = xO i). 

p (^y 1 x^*^) oc p (^y 1 x(‘)) r (y) 

(52) 

= V (y; p, S) r (y). 

(53) 

We can rewrite this in terms of energy functions, where Er (y) = — log r (y). 


P (y 1 cx exp [-E (y)] 

(54) 

(y) = ^ (y - pf (y - m) + E (y) . 

(55) 


If Er (y) is smooth relative to | (y — (y — //), then we can approximate it using its Taylor expansion around fi. 

One sufficient condition is that the eigenvalues of the Hessian of (y) are everywhere much smaller magnitude than the 
eigenvalues of 11“^. We then have 


Er (y) ^ Er ifi) + (y - m) g 


(56) 


dEr{y') 

where g = 


Plugging this in to the full energy, 
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= ^ (y - M + Sg)^ S ^ (y - M + Sg) + constant. 
This corresponds to a Gaussian, 

p(y |xW) «V(y;M-Sg,S). 
Substituting back in the original formalism, this is, 
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D. Experimental Details 

D.l. Toy Problems 

D.1.1. Swiss Roll 

A probabilistic model was built of a two dimensional swiss 
roll distribution. The generative model p con¬ 

sisted of 40 time steps of Gaussian diffusion initialized 
at an identity-covariance Gaussian distribution. A (nor¬ 
malized) radial basis function network with a single hid¬ 
den layer and 16 hidden units was trained to generate the 
mean and covariance functions f^ , t) and a diago¬ 
nal f^ for the reverse trajectory. The top, read¬ 

out, layer for each function was learned independently for 
each time step, but for all other layers weights were shared 
across all time steps and both functions. The top layer out¬ 
put of fj] (x^^^, t) was passed through a sigmoid to restrict 
it between 0 and 1. As can be seen in Figure 1 , the swiss 
roll distribution was successfully learned. 

D.1.2. Binary Heartbeat Distribution 

A probabilistic model was trained on simple binary se¬ 
quences of length 20, where a 1 occurs every 5th time 
bin, and the remainder of the bins are 0. The generative 
model consisted of 2000 time steps of binomial diffusion 
initialized at an independent binomial distribution with the 
same mean activity as the data (p =1^ = 0.2). A 
multilayer perceptron with sigmoid nonlinearities, 20 in¬ 
put units and three hidden layers with 50 units each was 
trained to generate the Bernoulli rates f^ (x^^^, t) of the re¬ 
verse trajectory. The top, readout, layer was learned inde¬ 
pendently for each time step, but for all other layers weights 
were shared across all time steps. The top layer output was 
passed through a sigmoid to restrict it between 0 and 1. As 
can be seen in Figure 2, the heartbeat distribution was suc¬ 
cessfully learned. The log likelihood under the true gener¬ 
ating process is log 2 (|) = —2.322 bits per sequence. As 
can be seen in Figure 2 and Table 1 learning was nearly 
perfect. 

D.2. Images 

D.2.1. Architecture 

Readout In all cases, a convolutional network was used 
to produce a vector of outputs G for each image 
pixel i. The entries in are divided into two equal sized 
subsets, y^ and y^. 


Temporal Dependence The convolution output y^ is 
used as per-pixel weighting coefficients in a sum over time- 
dependent “bump” functions, generating an output e IZ 


for each pixel i. 


z 


i 


i=i 


(62) 


The bump functions consist of 


Efe=iexp(-2^(i-rfc)^) 


(63) 


where Tj G (0, T) is the bump center, and w is the spacing 
between bump centers, is generated in an identical way, 
but using . 

For all image experiments a number of timesteps T = 1000 
was used, except for the bark dataset which used T = 500. 


Mean and Variance Finally, these outputs are combined 
to produce a diffusion mean and variance prediction for 
each pixel i, 

(zf + {(it)) , (64) 

= {Xi - zf) (1 - Sii) + . (65) 

where both S and p are parameterized as a perturbation 
around the forward diffusion kernel (x^^^ /3t), 

and is the mean of the equilibrium distribution that 
would result from applying p many times. S 

is restricted to be a diagonal matrix. 


Multi-Scale Convolution We wish to accomplish goals 
that are often achieved with pooling networks - specif¬ 
ically, we wish to discover and make use of long-range 
and multi-scale dependencies in the training data. How¬ 
ever, since the network output is a vector of coefficients 
for every pixel it is important to generate a full resolution 
rather than down-sampled feature map. We therefore define 
multi-scale-convolution layers that consist of the following 
steps: 


1. Perform mean pooling to downsample the image to 
multiple scales. Downsampling is performed in pow¬ 
ers of two. 

2. Performing convolution at each scale. 

3. Upsample all scales to full resolution, and sum the re¬ 
sulting images. 

4. Perform a pointwise nonlinear transformation, con¬ 
sisting of a soft relu (log [1 + exp (•)]). 

The composition of the first three linear operations resem¬ 
bles convolution by a multiscale convolution kernel, up to 
blocking artifacts introduced by upsampling. This method 
of achieving multiscale convolution was described in (Bar¬ 
ron et al., 2013). 




Deep Unsupervised Learning using Nonequilibrium Thermodynamics 


Mean 

image 

Temporal 

coefficients 


X 


X 


i 


N 




f 






Convolution 
1x1 kernel 



X 


Convolution 
1x1 kernel 


Dense 


• • • 


Dense | ~| 

N 

Input ^ 



Covariance 

image 

Temporal 

coefficients 


Multi-scale 

convolution 

Multi-scale 

convolution 


Figure D.L Network architecture for mean function , tj 

and covariance function fs , for experiments in Section 

3.2. The input image x^^^ passes through several layers of multi¬ 
scale convolution (Section D.2.1). It then passes through several 
convolutional layers with 1x1 kernels. This is equivalent to a 
dense transformation performed on each pixel. A linear transfor¬ 
mation generates coefficients for readout of both mean and 
covariance for each pixel. Finally, a time dependent readout 
function converts those coefficients into mean and covariance im¬ 
ages, as described in Section D.2.1. For CIFAR-10 a dense (or 
fully connected) pathway was used in parallel to the multi-scale 
convolutional pathway. For MNIST, the dense pathway was used 
to the exclusion of the multi-scale convolutional pathway. 


Dense Layers Dense (acting on the full image vector) 
and kernel-width-1 convolutional (acting separately on the 
feature vector for each pixel) layers share the same form. 
They consist of a linear transformation, followed by a tanh 
nonlinearity. 

































